Sequence Labeling

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐ Structured Prediction ⮐

Core Concept

Sequence labeling is the task of assigning a label to each element of an input sequence so that the output is a sequence of labels of the same length. The input is typically a sequence of tokens (e.g. words, characters, or time steps) and the output is a corresponding sequence of tags (e.g. part-of-speech, named-entity type, chunk type). Dependencies between labels are often important—adjacent or nearby tags tend to follow valid patterns (e.g. IOB constraints in NER)—so models usually capture local or long-range structure rather than predicting each position independently. This makes sequence labeling a core structured-prediction problem with applications in NLP, speech, and time-series analysis.

Key Characteristics

One label per position – Output length equals input length; each token (or frame) gets exactly one label. This distinguishes sequence labeling from segmentation where boundaries are predicted, though encoding schemes like BIO merge the two (labels encode both segment type and position).
Label dependencies – Labels are not independent: valid tag sequences obey constraints (e.g. I-after-B for the same type in BIO; grammatical tag sequences). Models use Markov assumptions (e.g. bigram), linear-chain CRFs, or recurrent/attention networks to capture these dependencies.
Decoding – Finding the best label sequence given the model is done with Viterbi (for linear-chain factorized scores) or beam search (for neural models); exact decoding is tractable for chain-structured factors.
Training signal – Supervision is usually full label sequences; training maximizes likelihood (generative HMM) or conditional likelihood (discriminative CRF, neural), or minimizes a margin/structured hinge loss.

Common Applications

Part-of-speech (POS) tagging – Assigning grammatical category (noun, verb, etc.) to each word in a sentence
Named entity recognition (NER) – Identifying and typing spans such as person, organization, location (often with BIO/BIOES encoding)
Chunking (shallow parsing) – Labeling segments such as noun phrase, verb phrase
Speech recognition (frame-level) – Labeling each frame or segment with a phone or state
Biological sequence annotation – Labeling residues or bases with function or structure
Gesture and activity recognition – Labeling each time step with gesture or activity class

Sequence Labeling Algorithms
Sequence labeling algorithms differ in how they model label dependencies (generative vs discriminative, local vs global), whether they use handcrafted features or learned representations, and in decoding (exact Viterbi vs beam search). Choice depends on data size, need for interpretability, and whether rich context (e.g. transformers) is required.

Hidden Markov Model (HMM) – Generative model: transition probabilities between hidden states (labels) and emission probabilities from state to observation (token). Decoding with Viterbi; training with Baum–Welch (EM) or supervised MLE. Interpretable and tractable; limited by Markov assumption and local features.

Based on: Markov chain + emissions (generative sequence model)

Method Group: Probabilistic graphical models

https://en.wikipedia.org/wiki/Hidden_Markov_model

Maximum Entropy Markov Model (MEMM) – Discriminative: per-position classifier (e.g. MaxEnt) conditioned on previous label; trained to maximize conditional likelihood of the label sequence. Decoding with Viterbi; avoids modeling observation distribution but susceptible to label bias.

Based on: Per-position classification conditioned on previous state

Method Group: Discriminative sequence models

https://repository.upenn.edu/cis_papers/159/

Conditional Random Field (CRF) – Discriminative: globally normalized distribution over label sequences; factors over edges (and optionally nodes) in a linear chain. Trained with conditional MLE; decoding with Viterbi. Handles feature-rich inputs and avoids label bias; widely used as a layer on top of feature or representation extractors.

Based on: Undirected graphical model (linear chain) with conditional likelihood

Method Group: Discriminative sequence models

https://sklearn-crfsuite.readthedocs.io/

https://pytorch.org/docs/stable/generated/torch.nn.CRF.html

BiLSTM-CRF – Bidirectional LSTM encodes the input sequence; CRF layer on top models label dependencies and is decoded with Viterbi. Combines neural representation with globally normalized label structure; standard baseline for NER and POS.

Based on: BiLSTM (representation) + CRF (label structure)

Method Group: Neural sequence models

https://arxiv.org/abs/1603.01360

BiLSTM (no CRF) – Bidirectional LSTM with a softmax (or classifier) per position; each position predicted independently given the hidden state. No explicit label dependency; simpler and faster than CRF, often slightly worse on strict metrics.

Based on: BiLSTM (per-position classification)

Method Group: Neural sequence models

Transformer / BERT for sequence labeling – Pre-trained transformer (e.g. BERT) encodes the sequence; a linear or small MLP head on top of each token representation predicts the label. Fine-tuned with cross-entropy (per token). Captures long-range context; state-of-the-art for NER and POS when data and compute allow.

Based on: Transformer encoder (per-token representation + classification head)

Method Group: Neural sequence models > Pre-trained encoders

https://huggingface.co/docs/transformers/tasks/token_classification